feat(evaluation): offline evaluation module with uv run evaluate CLI by ShuxinLin · Pull Request #280 · IBM/AssetOpsBench

ShuxinLin · 2026-04-27T20:27:53Z

Summary

New src/evaluation/ module: load saved agent trajectories + scenarios → grade → emit JSON report
Three graders: exact_string_match, numeric_match (deterministic), and a pluggable LLM judge with a six-criterion rubric
Per-task ops metrics (turns, tool calls, tokens, duration) + aggregate rollup (totals, p50/p95, optional cost estimate)
uv run evaluate CLI registered in pyproject.toml
39 new unit tests; full repo suite stays green (309 passed)

Layout follows the three-stage run → evaluate → report pattern used by SWE-bench, HELM, and τ-bench. Re-grading from saved trajectories is first-class — no need to re-run the agent.

Closes #279

Test plan

uv run pytest src/evaluation/ -v — 39 passed
uv run pytest src/ -v -k "not integration" — 309 passed (no regressions)
CLI smoke test: uv run evaluate --trajectories <dir> --scenarios <file> --output report.json --grader-default exact_string_match produced the expected report
Grader override path verified: a per-scenario grading_method overrides --grader-default
Trajectory metric extraction tested for both SDK Trajectory dict and plan-execute list[StepResult] shapes
Reviewer: try --judge-model litellm_proxy/anthropic/claude-opus-4-5 against a real LiteLLM proxy on a small batch

Implement src/evaluation/ — consumes saved agent trajectories ({run_id}.json under AGENT_TRAJECTORY_DIR) and scenario files, joins them on scenario_id, runs a registered grader per scenario, and emits a JSON report combining grading results with operational metrics (tokens, duration p50/p95, tool calls, optional cost estimate). The shape follows SWE-bench / HELM / τ-bench conventions: agent run → evaluate → report.json, with offline re-grading from saved trajectories as a first-class workflow. Includes: - Pydantic models (Scenario, PersistedTrajectory, GradeResult, OpsMetrics, EvalReport) - Loader for trajectory dirs and JSON/JSONL scenario files - Grader registry with two deterministic graders (exact_string_match, numeric_match) and a pluggable LLM judge bound to LLMBackend (six-criterion rubric) - Per-task ops metric extraction (handles both SDK Trajectory and plan-execute list[StepResult] shapes) plus aggregate rollups - Report writer with terminal summary and JSON output - evaluate script registered in [project.scripts] - 39 unit tests covering models, loader, graders, metrics, report, and end-to-end runner — all passing alongside existing 270 tests Closes #279 Signed-off-by: Shuxin Lin <linshuhsin@gmail.com>

DhavalRepo18 · 2026-04-28T15:38:00Z

https://mlflow.org/docs/latest/genai/concepts/scorers/ Please use these concept and prefer to use Scorer

Evaluator has multiple Scorer
- LLM-As-Judge
- Semantic-Score
- Code-Based

This was referenced Apr 29, 2026

Resolve IBM evaluation-module dependency strategy HPML6998-S26-Team13/AssetOpsBench#8

Open

AOB-WS1 Evaluation adapter + parity proof HPML6998-S26-Team13/AssetOpsBench#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): offline evaluation module with uv run evaluate CLI#280

feat(evaluation): offline evaluation module with uv run evaluate CLI#280
ShuxinLin wants to merge 1 commit intomainfrom
feat/evaluation-module

ShuxinLin commented Apr 27, 2026

Uh oh!

DhavalRepo18 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShuxinLin commented Apr 27, 2026

Summary

Test plan

Uh oh!

DhavalRepo18 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DhavalRepo18 commented Apr 28, 2026 •

edited

Loading